1 Overview

As a consequence of cheaper, smarter, and faster mobile software and devices, increasing numbers of people and businesses are spending more time networking, shopping, learning, banking and entertaining via tablets and smart-phones. Mobile OS platforms and companies like Swiftkey have evolved word prediction features which have been shown to reduce keystrokes and spelling errors while improving the speed and accuracy with which users are able to text. This project aims to implement a light-weight, corpus-based, word prediction capability that could theoretically run on a mobile phone or tablet. Whereas many word prediction systems are probabilistic models based upon word frequencies, PredictifyR integrates an n-gram based probabilistic language model with syntactical (part-of-speech) and semantic information to improve predictive accuracy.

Figure 1: PredictifyR Project Schedule

2 Experimental Methods

The following experiments were designed to address four research questions. First, what is the affect of training set size on probabilistic language model prediction accuracy, size and runtime. Second, to what degree does the incorporation of syntactic information improve predictive accuracy of probabilistic language models. Third, how does the integration of semantic knowledge affect probabilistic language model prediction performance. Lastly, how is prediction accuracy affected by linearly interpolating the probabilistic, syntactic and semantic models.

This section serves three purposes. (1) Outline the materials, hardware and software environment, toolsets and natural language processing packages used during the project. (2) Briefly establish the degree to which the corpus used on this project adequately represents the situational parameters, and therefore the linguistic features and lexical variation of the language encountered within a mobile texting context, and (3), Clearly outline the project’s seven-stage experimental methodology: (1) data acquisition, (2) data cleaning, (3) data sampling, (4), data analysis, (5) data processing, (6), predictive modeling, and (7) model evaluation. Models were evaluated intrinsically using the perplexity metric. The model with the lowest perplexity, given its size and runtime would be implemented as a shiny-based data product.

2.1 Materials

2.1.1 Software Environment

The project was implemented on a Windows x64-based laptop powered by an Intel Core i7-3610QM CPU @ 2.30GHz, 2301 MHz processor with 4 Cores, 8 Logical Processors, and 16.0 GB of installed memory, running the Microsoft Windows 10 Home operating system, version 10.0.14393 Build 14393. The scripts were authored using the 64 bit version of the R Programming Language, version 3.3.1. (The R Foundation 2015) within R. Studio Version 0.99.903 (Rstudio Team 2016) development environment. The complete list of R packages used can be found in Appendix B. The data.table package (Dowle, n.d.) for R was used as the data base for language modeling.

2.1.2 Natural Language Processing Software

Several Natural Language Processing (NLP) software packages provided essential functionality throughout the analysis, modeling, experimentation, and evaluation phases of the project. The quanteda package, version 0.9.9-3 (Benoit et al. 2016), provided tokenization, and feature counting and analysis functionality. Part-of-speech (POS) tagging was provided by the openNLP package package, version 1.5.3 (Kottmann et al. 2016). Lastly, the lsa package [Wild2015] delivered the latent semantic analysis functionality. Zipf analyses were conducted using the ZipfR package (Evert 2015).

Source code is available and can be forked from https://github.com/j2scode/PredictifyR.

2.2 Data Acquisition

The corpus used for this project was obtained from the HC Corpora website, a collection of freely available texts comprised of over 2.5 billion words from 67 languages (Christensen 2016). The English language corpus, consisting of approximately 70 million words across three registers: news, blogs, and twitter feeds, was used for this project. The descriptive statistics are as follows:

Table 1: Raw Corpus Descriptive Statistics

Genre Size (MB) Sentences Tokens Words Types Mean Sentence Length Mean Word Length
Blogs 382.7 2,346,640 44,945,579 36,953,830 356,079 15.75 4.35
News 30.9 151,931 3,193,924 2,598,876 91,844 17.11 4.74
Twitter 325.4 3,758,014 37,216,902 29,513,317 433,439 7.85 4.22
Corpus 739.0 6,256,585 85,356,405 69,066,023 654,295 11.04 4.31

The following priori qualitative analysis establishes the author’s reasons for accepting the corpus as qualitatively representative of the situations, linguistic features, and lexical variation to be encountered in a mobile word prediction context.

2.2.1 Blogs Register

The blogs register, comprised of 2,346,640 sentences, 36,953,830 words, and 356,079 types, covered a range of topics to a diverse, unenumerated public audience. Interaction ranged from minimal to extensive, the size of the audience, and the nature of the content. Based upon available research, the blogger was likely 21-to-35 years of age, as this group accounted for some 53% of the total blogger population, followed by the 21 and less age group which constituted some 20% of the blogger population (Kennon 2015). Content ranged from factual to imaginative, and served to persuade, inform, entertain, instruct, explain, share opinions, and raise one’s social / professional profile. This register would be characterized by a wide continuum of formal and informal text, low to moderate use of slang, and a greater dispersion of lexical variety among the samples. Language models may have to accommodate liberal use of common internet or specialized content abbreviations.

2.2.2 News Register

The smallest of the registers, the online news content contained some 151,931 sentences, 2,598,876 words, and 91,844 word types, Likely produced in institutional or business settings, the news content could be characterized as factual, informative, opinion-based, and with the exception of public opinion pages, non-interactive. Addressed to an unenumerated public, the news register would be characterized by greater formality, the use of standard English, few idioms and slang expressions, and relatively high degree of lexical diversity. Statistical language models that capture the lexical diversity of the population are likely to perform well, as should topic-based or semantic language models.

2.2.3 Twitter Register

The twitter register, comprised of so-called “tweets” - text messages of 140 characters or less, were produced in largely personal-private settings on computers and on mobile phones. The addressees included the enumerated network of “friends”, the enumerated members of groups to which the “tweeter” was subscribed, and, since messages could be “retweeted”, the unenumerated public. According to Duggan (2015), 23% of internet users used twitter in 2015, the 18-29 age group represented the most prolific user group, followed by the 30-49 age group. Thirty-two percent of internet users age 18-29 (40% of all twitter users) and 29% of those age 30-49 (36% of all twitter users), used twitter (Duggan 2015). Consisting of some 3,758,014 sentences, 29,513,317 words, and 433,439 types, tweets tended to be opinion oriented with occasional supporting facts and served a range of purposes. Meeting new people from various backgrounds and locations, establishing and maintaining connections with others, grow friendships, and raising one’s social media profile are among the many purposes of the tweet. Twitter content was likely to be characterized by informality, greater use of idioms and slang, non-standard spelling, and a lower degree of lexical richness.

Combined, these registers reflect the situational and therefore, linguistic diversity representative of the population of content for which word prediction functionality would be used.

2.2.4 Reference Data

In addition, supporting reference data used in this project included a list of commonly used English language contractions (Wikipedia 2016), a list of profane words courtesy of Google’s What Do You Love Project (Google 2012), a file containing common internet and text messaging abbreviations was obtained from Wiktionary (Wiktionary 2016) and a collection of emoticons was similarly compiled from Internet sources (Cool-Smileys 2010). Each source was manually sanitized, and conditioned to eliminate ambiguity, profanity, and conflicts. All reference files are freely available as csv files and can be downloaded from the data/referenceData subdirectory at https://github.com/j2scode/PredictifyR.

2.3 Data Cleaning

The corpus was downloaded from the HC Corpora website (Christensen 2016), reshaped into sentences the raw corpus was organized by line, some having multiple sentences), and audited for data quality issues. The following summary Table 2 reveals several anomalies that were addressed during the data cleaning process.

Table 2: Raw Corpus Audit

Register Size (Mb) Sentences Tokens Words Types Mean Word Length Max Word Length % Special Chars % Contractions % Abbreviations % Profanity
Blogs 382.7 2,346,640 44,945,579 36,953,830 356,079 4.35 91 11.10 0.54 0.06 0.10
News 30.9 151,931 3,193,924 2,598,876 91,844 4.74 34 4.17 0.70 0.04 0.02
Twitter 325.4 3,758,014 37,216,902 29,513,317 433,439 4.22 120 2.20 1.36 0.58 0.30
Corpus 739.0 6,256,585 85,356,405 69,066,023 654,295 4.31 120 6.96 0.90 0.28 0.19

The corpus contained some 85,356,000 tokens, of which approximately 81 % were words. Special characters (about 7 % of the corpus), punctuation, symbols and digital characters made up the rest. An inspection of words exceeding 40 characters in length revealed repeated string patterns, urls, email addresses, and various anomalies that could be eliminated in preprocessing. Somewhat surprisingly, contractions, common abbreviations and profanity made up less than 2 % of the corpus.

During the cleaning process, the raw data were reshaped into sentences using the quanteda package (Benoit et al. 2016) and converted to lower case. Non UTF-8 encoding was corrected and special, control, non-printable, and non-ASCII characters were removed. Approximately 230 common misspellings, contractions, and abbreviations, were normalized. Email addresses, urls, and twitter hashtags were excised from the text. Hyphenated words were split and digits, symbols and punctuation (with the exception of the apostrophe not surrounded by whitespace, periods, exclamation points, and question marks) were extracted. Lastly, profanity and words exceeding 40 characters in length were eliminated. Stop words, made up largely of function words, were retained. Due to the size of the corpus and the processing speeds of available spell checkers, automated spell checking and correction was not performed.

The clean corpus descriptive statistics are as follows.

Table 3: Clean Corpus Descriptive Statistics

Genre Size (MB) Sentences Tokens Words Types Mean Sentence Length Mean Word Length
Blogs 311.5 2,315,485 37,110,484 37,110,484 272,219 16.03 4.38
News 24.9 150,118 2,608,731 2,608,731 75,667 17.38 4.73
Twitter 248.6 3,702,327 29,354,830 29,354,830 255,494 7.93 4.16
Corpus 585.0 6,167,930 69,074,045 69,074,045 429,412 11.20 4.30

Weighing in at about 585 Mb, the corpus contains a vocabulary of 429,412 words. An N-gram language model based upon this corpus with \(V^2\) = 184,394,665,744 bigrams and \(V^3\) = 7.918128e+16 trigrams, would approach memory capacity, especially for a mobile device. The following section outlines the design process for a compact, yet representative corpora.

2.4 Data Sampling

An intrinsic evaluation of any language model requires a training corpus, a separate development test set, and an unseen test set for final model evaluation. Typically, 80% of the data are designated for training, 10% for development and 10% for testing. However, training on a large corpus such as the HC Corpora with a vocabulary \(V\) of some 654,295 can present some practical and performance problems. Chiefly, the N-gram probability matrices can be unwieldy, exceedingly sparse, and computationally expensive. There would be \(V^2\) = 428,101,947,025 possible bigrams and the number of possible trigrams would be \(V^3\) = 2.80105e+17. A prediction model that could run on a mobile device must balance potentially lower test set perplexity (higher accuracy) with computational and response time performance considerations. As such, intelligent sampling from the master corpus must produce a language model corpus which maximizes representativeness and minimizes size.

Representativeness, in the context of this project, has a two part definition. First, the lexical resource should represent the range of lexical diversity in the HC Corpus. That is, the number of out-of-vocabulary (OOV) words in the lexical resource should be below some established threshold. The target for this project was 95% coverage, or no more than 5% OOV tokens. Second, the lexical resource should represent the distribution of linguistic features or text types in the HC Corpus.

Thus, the aim at this stage was to create a model corpus which was computationally manageable and was representative according to the above definition. Two analyses were undertaken to address the question of representativeness. The lexical diversity analysis applied Zipf’s law and word frequency distribution analysis to the problem of language coverage and lexical size. The lexical feature analysis examined the distribution of lexical features to ascertain optimal lexical size.

2.4.1 Lexical Diversity Analysis

Here, Zipf’s law and word frequency distribution analysis revealed an estimate of optimal lexical size given the coverage target. This section includes a review of Zipf’s law, an overview of how it is used to estimate lexical size, an evaluation of good-of-fit, a summary of the Zipf-Mandelbrot model, and the lexical size estimate. Finally, the sample corpus is created and the OOV rate, vis-a-vis the HC Corpus is verified.

2.4.1.1 Zipf’s Law

Popularized by American linguist, George Kingsley Zipf, Zipf’s law explains one of the most basic, yet puzzling aspects of human language: words occur according to a mathematically simple, yet systematic frequency distribution whereby few very high frequency words account for most of the tokens in text. This strikingly simple distribution follows a power law known as Zipf’s law, which states that the \(r\)th most frequent word has a frequency \(f\)(r) that scales according to \[f(r)\propto{\frac{1}{r^{\alpha}}}\] where \(\alpha\approx1\) (George Kingsley Zipf 1935). Roughly speaking, the frequency by which a word occurs is inversely proportional to its rank such that the \(nth\) most common frequency will occur with a frequency \(1/n\) that of the highest frequency word. So, the second most common word in a natural corpus will occur \(1/2\) as much as the most common and the third most common word will occur \(1/3\) as often as the most frequent, and so on. This word frequency distribution is commonly referred to as a “Zipfian”" distribution.

Assuming a corpus word frequency distribution models a Zipfian distribution, these characteristics can be used to define optimal properties such as corpus sample and lexical sizes. In fact, it can be shown that a high degree of vocabulary coverage can be achieved with a relatively small sampling of a corpus. To illustrate, the following frequency spectrum for the blogs register summarizes the word frequency distribution in terms of number of types (\(Vm\)) per frequency class (\(m\)). It reports how many distinct types occur once, twice, etc…

Figure 2: Frequency spectrum for blogs register

2.4.1.2 Applying Zipf’s Law to Lexical Size Problem

Two additional columns have been added, \(V\), the cumulative vocabulary size and \(N\), the cumulative number of words by frequency class. To determine the number of types required to achieve 95% coverage, take \(N=\) 37,110,484, the number of words in the blogs register and \(V=\) 280,731 the blogs register vocabulary size. The corresponding OOV rate of 5% equates to \(Noov=\) 1,855,524 words. According to the frequency spectrum, the OOV rate is reached within the top 113 frequency classes. Subtracting the corresponding \(Voov=\) 265,927 from total vocabulary size \(V=\) 280,731 yields 14,804, the number of types that the lexical resource must include in order to achieve 95% token coverage. The corresponding number of tokens can be obtained from a vocabulary growth curve, which relates vocabulary size with the number of word tokens.

The next step in the analysis was to ascertain the degree to which Zipfs law, in its classic form, could be applied as above to determine optimal lexical size.

2.4.1.3 Zipfian Distribution Goodness-of-Fit

The Zipf Plot, which relates log rank to log frequency, provides a good visual indicator of goodness-of-fit. The data conform to Zipf’s law to the extent that the plot is linear. The following plots graphically characterize goodness-of-fit for each register.

Figure 3: Zipfian Distribution Goodness of Fit

The curvilinear line shapes indicate systematic variation from a Zipfian distribution. The following residual plots confirm problems with goodness-of-fit.

Figure 4: Zipf Residual Plots

This result was not surprising. In a large scale analysis of over 30,000 texts in the Project Gutenberg database (Moreno-Sanchez, Font-Clos, and Corral 2016), goodness of fit tests showed that only 15% of the texts were compatible with this form of Zipf’s law. The reason is that the language is creative, evolving, and characterized by a very large, perhaps infinite number of type probabilities and word frequency distributions are characterized by a large number of low probability events (Khmaladze 1988). Such distributions, referred to as LNRE (large number of rare event) distributions, contain rare events that occur in the population that do not occur in the sample, regardless of its size. Since the joint probability of unseen events in the sample sum up to 1, it must be adjusted to in order to free probability space for the unseen types (Baayen and R. H. Baayen 2008). Fortunately, LNRE models have been developed to adjust for the large probability of unseen events.

2.4.1.4 Zipf-Mandelbrot Model

One such LNRE model is based upon the following generalization of Zipf’s law proposed by Benoit Mandelbrot (Mandelbrot 1961).

\[f(r) = \frac{1}{(r+\beta)^\alpha}\] By “shifting” the rank by an amount β, the Zipf-Mandelbrot (ZM) formulation more closely fits the frequency distribution in language [George K Zipf (1936); Zipf49; Mandelbrot1953; Mandelbrot61]. Using the zipfR package (Evert 2015), ZM models were independently fit against each register of the corpus. Parameters \(\alpha\) and \(\beta\) were estimated as follows:

Table 4: Zipf-Mandelbrot Model Parameters by Register

Register Alpha Beta
Blogs 0.4839 0.0011
News 0.5020 0.0014
Twitter 0.5257 0.0021

Frequency Spectra Frequency spectra are the essential data structures used to estimate the required lexical vocabulary size to meet a language coverage target. Expected frequency spectra were estimated using the ZM model and are juxtaposed against the observed word frequency distributions from the corpus.

Figure 5: Zipf-Mandelbrot Expected vs. Observed Frequency Spectra

The observed and expected frequency spectra appear to fit very well. Next, the observed (smoothed) and extrapolated vocabulary growth curves are evaluated for fit.

Vocabulary Growth Curves Vocabulary growth curves (VGCs) relate vocabulary size \(V\) to sample size \(N\). Expected VGCs were extrapolated using the ZM model from observed VGCs and are plotted vis-a-vis smoother (interpolated) observed VGCs.

Figure 6: Zipf-Mandelbrot Expected vs. Observed Vocabulary Growth Curves (Blogs)

Figure 7: Zipf-Mandelbrot Expected vs. Observed Vocabulary Growth Curves (News)

Figure 8: Zipf-Mandelbrot Expected vs. Observed Vocabulary Growth Curves (Twitter)

Both spectra and vocabulary growth curves fit the data reasonably well. Now, the lexical size can be estimated.

2.4.1.5 Lexical Size Estimate

The lexical size estimates were obtained by comparing observed vocabulary sizes \(V\) of corpus registers at various sample sizes, with the ZM model derived expected vocabulary \(EV\) of the HC Corpus registers. Using the ZipfR package (Evert 2015), frequency spectra were created for each register at varying sample sizes. Based upon these observed spectra, ZM models were trained and an expected vocabulary \(EV\) size for the full corpus register was estimated. Out-of-vocabulary word types \(Voov\) were calculated by subtracting the observed vocabulary \(V\) from the ZM model expected vocabulary \(EV\). Utilizing the ZM model derived frequency spectrum, the associated numbers of out-of-vocabulary tokens \(Noov\) were determined by cumulatively summing the product of the frequency classes \(m\) and the vocabulary \(Vm\), until \(Voov\) was reached. The out-of-vocabulary rate (OOV Rate) was obtained by dividing \(Noov\) by the number of tokens in the HC Corpus register. The coverage rates at various sample sizes are summarized below.

Figure 9: Lexical coverage by sample size

Two observations stood out. First, a high degree of lexical coverage could be obtained with relatively small samples from the HC Corpora. Second, the required sample size grows dramatically with increased coverage. On average, a one point increase in coverage required NaN additional tokens to be added to the lexicon. Notwithstanding, the coverage target of 95% was increased slightly given the general tendency of LNRE models to underestimate vocabulary sizes in extrapolation (Evert and Baroni 2005). The following summarizes the lexical diversity-based sample size estimates

Table 5: Lexical diversity-based sample size estimate

Register % Tokens Sample V Extrapolated V Estimated Voov Estimated Noov OOV Rate Coverage
Blogs 3 1,113,314 50,963 365,182 314,219 627,494 2 98
News 15 391,309 31,765 97,569 65,804 81,288 3 97
Twitter 3 880,644 41,194 309,381 268,187 527,587 2 98
Corpus 7 2,385,267 123,922 772,132 648,210 1,236,369 2 98

The analysis shows that, an average 98% coverage rate could be achieved with a lexicon of 4,770,534 tokens, 7 percent of the size of the HC Corpus.

Next, the distribution of lexical features will be examined vis-a-vis the sample size estimation problem.

2.4.2 Lexical Feature Analysis

Corpus representativeness, (Biber 1993) argues, also depends upon “the extent to which it contains the range of linguistic distributions in the population”. Thusly,this analysis examines the distribution of lexical features vis-a-vis optimal lexical size in terms of the size of the sampling unit, the number of sampling units in the overall lexical resource as well as the stratified representation of each register within the corpus. To capture the corpus’ linguistic nuances, the following 10 lexical features commonly used in variation studies were studied:

Table 6: Descriptive Statistics for frequency scores (per 2,000 words)

Tag Description Mean Max Mean Range Total % Total SD Variation Coefficient Tolerable Error Sample Size
IN Preposition/Subord. Conj. 195 299 242.79 104 24,279 19.71 22.14 0.09 12.14 12.78
NN Noun, Singular Or Mass 328 543 406.67 215 40,667 33.02 38.30 0.09 20.33 13.63
NNS Noun Plural 75 157 111.50 82 11,150 9.05 16.29 0.15 5.58 32.78
PRP Personal Pronoun 64 181 128.47 117 12,847 10.43 24.53 0.19 6.42 56.02
VB Verb Be, Base Form 66 156 107.99 90 10,799 8.77 22.09 0.20 5.40 64.30
VBD Verb Be, Past 39 105 64.40 66 6,440 5.23 15.45 0.24 3.22 88.40
VBG Verb Be, Gerund/Participle 34 76 51.03 42 5,103 4.14 8.16 0.16 2.55 39.30
VBN Verb Be, Past Participle 22 63 43.67 41 4,367 3.55 10.59 0.24 2.18 90.43
VBZ Verb Be, Pres, 3Rd P. Sing 41 85 63.03 44 6,303 5.12 9.81 0.16 3.15 37.22
WP Wh-Pronoun 4 20 12.11 16 1,211 0.98 3.40 0.28 0.61 121.48

Prepositions, nouns, and verbs were not only the most frequent per words, but had the greatest range of variation among the samples, nouns being the most varied of the tags studied. WH relative clauses were quite rare, accounting for less than 1% of the parts-of-speech tags.

2.4.2.1 Sampling Unit

According to Biber (1993), each sampling unit should represent the range of linquistic characteristics in the text. To determine sampling unit size, the corpus was split into 50 pairs of 100, 500, 1000, and 2000 word texts. Chi-squared tests compared the distributions of the 10 selected lexical features between the pairs of texts. If large differences were found between the two pairs of samples, the sample size did not adequately represent the overall linguistic characteristics of the text. Figure 10 summarizes the chi-Square p-values for distribution of lexical features between texts.

Table 7: Chi-Square p-values for distribution of lexical features between texts

Register Size IN NN NNS PRP VB VBD VBG VBN VBZ WP Mean
Blogs 2,000 0.9321 0.0185 0.0766 0.0016 0.3583 0.0249 0.6309 0.4612 0.0415 0.1323 0.2678
News 2,000 0.8321 0.0235 0.7245 0.0058 0.2708 0.0072 0.0368 0.1390 0.3567 0.6374 0.3034
Twitter 2,000 0.4761 0.5872 0.5779 0.0944 0.7261 0.7935 0.6987 0.2784 0.5452 0.5607 0.5338
Corpus Mean 2,000 0.7468 0.2097 0.4597 0.0339 0.4517 0.2752 0.4554 0.2929 0.3145 0.4435 0.3683

The average p-Value across all 2,000-word samples and POS tags was 0.37. At the corpus level, the p-values were above \(\alpha = 0.05\), indicating that the samples came from a similar distribution. Despite the dispersion of nouns, prepositions and verbs in the news register, indicating that a larger sampling unit might be necessary, the mean p-value across all tags was . Thus, the null hypothesis of similar distributions was not rejected and the sampling unit size of 2,000 tokens would be the answer.

2.4.2.2 Total Corpus Size

Total corpus size is based upon the distribution characteristics of each linguistic feature. More precisely, there are \(n_i\) different estimates of total corpus size, one for each of the \(i\) lexical features. Once a calculation is made for each feature \(i\), the largest \(n\) was selected as the estimate. The corpus size estimate is thereby given by the following equation: \[max(n_i) = \frac{s^2}{(\frac{te}{t})^2}\] where \(n_i\) is the computed sample size, in terms of contiguous sampling units, associated with the \(i\)th lexical feature, \(s\) is the estimated standard deviation of the feature in the population, \(te\) is the tolerable error (equal to 1/2 of the desired 95% confidence interval), \(t\) is the \(t\)-value for the desired probability level (1.79588482 for \(\alpha\) = 0.05 with 11 degrees of freedom) (Biber 1993).

The corpus sample size, in terms of 2,000-word samples (Table 6Table 8), is 121, or 242,000 tokens. Rare lexical features (less than 5% of total) were not considered. Next, the proportional representation of each register is determined.

2.4.2.3 Register Size

The sample size for each register comprised a base component - a minimum number of texts to be allocated to each register - plus a proportional component that was based upon the degree of variation in each register. The formula for calculating register sample size is: \[n_r = (t\cdot b) + \lambda \cdot avc_r\] where \(r\) is the register, \(t\) is total corpus size, 121, and \(b = 0.1\), the base allocation or the fraction of the total corpus that each register was allocated. Lambda is: \[\lambda = \frac{t - \displaystyle\sum_{i=1}^{r} t \cdot b}{\displaystyle\sum_{i=1}^{r} avc_r}\] The numerator is the proportional allocation, total corpus size minus the base for each register. The average variation coefficient \(avc_r\) for each register is calculated as follows: \[avc_r = \frac {1}{f} \cdot \displaystyle\sum_{i=1}^{f} \frac{sd_f}{\mu_f}\] where \(f\) is 10, the number of features studied, \(sd_f\) is the standard of deviation for feature \(f\), and \(\mu_f\) is the mean distribution for feature \(f\). As such, the sample size allocated to each register is:

Table 9: Sample sizes for each register

Register Base Avg VC Lambda Proportion Num Samples Sample Length Sample Size
Blogs 12 0.14 208 30 42 2,000 84,000
News 12 0.14 208 29 41 2,000 82,000
Twitter 12 0.12 208 26 38 2,000 76,000
Corpus 36 0.41 208 85 121 2,000 242,000

2.4.3 Sampling Strategy

The following summarizes the analyses and outlines the design for the training, validation and test sets.

Table 10: Corpus analyses and design. Sample sizes are in tokens

Register Diversity-Based Estimate Lexical Feature-Based Estimate Training Set Validation Set Test Set Total
Blogs 1,113,314 84,000 890,651 111,331 111,331 1,113,314
News 391,309 82,000 313,047 39,131 39,131 391,309
Twitter 880,644 76,000 704,515 88,064 88,064 880,644
Corpus 2,385,267 242,000 1,908,213 238,526 238,526 2,385,267

The lexical diversity-based estimate of 2,385,267 tokens was significantly greater than that of the lexical feature analysis. Taking a conservative approach, the lexical diversity-based estimate was selected and split among the training (80%), validation (10%), and test sets (10%). Since the modeling would be based upon the training set, the model corpus size was increased by 25% so that the training corpus would be representative of HC Corpus. Table 11 outlines the descriptive statistics for the training corpus. $Note, the small differences in size were do to varying average sentence lengths across the registers.

Table 11: Descriptive statistics for the model corpus

Genre Size (MB) Sentences Tokens Words Types Mean Sentence Length Mean Word Length
Blogs 14 85,221 1,358,510 1,358,510 53,554 16 4
News 6 29,988 520,308 520,308 33,223 17 5
Twitter 6 67,473 532,910 532,910 30,715 8 4
Corpus 25 182,682 2,411,728 2,411,728 74,680 13 4

2.4.4 Sampling Strategy Verification

In order to test whether the model corpus, hereinafter the training corpus, was truly representative, two tests were conducted. The first checked the number of OOV words in the training corpus. More concretely, the number of words in the cleaned HC Corpus that were not in the training corpus was normalized by the HC Corpus vocabulary size to obtain an OOV Rate. Next, a comparison of the distribution of lexical features between the corpora was tested for similarity using a chi-squared test.

2.4.4.1 Vocabulary Verification

The following table summarizes the vocabulary and token OOV rates for the training set.

Table 12: Training corpus vocabulary OOV Rates

Corpus HC Corpus Vocabulary HC Corpus Tokens Training Corpus Vocabulary Training Corpus Tokens # OOV Words OOV Rate Coverage
Blogs 280,731 37,110,484 54,225 1,358,510 779,021 2 98
News 77,534 2,608,731 33,741 520,308 82,981 3 97
Twitter 264,089 29,354,830 31,108 532,910 928,837 3 97
Corpus 622,354 69,074,045 119,074 2,411,728 1,790,839 3 97

This analysis illustrates a feature of a “zipfian” distribution. Despite a vocabulary OOV rate of 19% the training corpus covered 97% of the tokens in the HC Corpus. Indeed, a total, 3% of the corpus provided 97 coverage.

2.4.4.2 Lexical Feature Verification

Chi-squared tests were conducted on the distributions of POS tags in the HC and training corpora. The hypothesis was that no statistically significant differences existed between the distributions of lexical features in the HC and training corpora. This hypothesis was rejected if data showed a difference with a five percent or less probability.

Table 13: Lexical Feature Distribution HC Corpora vis-a-vis Training Set (p-Value = 0.2313417)

Tag Description HC Corpus Training Corpus
IN Preposition/Subord. Conj. 242.79 248.14
NN Noun, Singular Or Mass 406.67 409.68
NNS Noun Plural 111.50 118.35
VB Verb Be, Base Form 128.47 116.81
VBD Verb Be, Past 107.99 99.97
VBG Verb Be, Gerund/Participle 64.40 69.93
VBN Verb Be, Past Participle 51.03 48.32
VBZ Verb Be, Pres, 3Rd P. Sing 43.67 46.87
PRP Personal Pronoun 63.03 63.23
WP Wh-Pronoun 12.11 12.85

The high p-Value indicated that there was a 23.1 percent chance that the differences were due to chance. Therefore, the hypothesis is not rejected and the Training Corpus represents the range of lexical variation of the HC Corpus.

2.4.5 Summary of Sampling Strategy

To sum up, two analyses were conducted. In the lexical diversity analysis, linear models based upon the Zipf Mandelbrot law provided sample sizes estimates based upon the distribution of word frequencies. The lexical feature analysis determined sample size in terms of chunk size and number of chunks, for each register based upon the variation in the distributions of lexical features. Taking the larger of the two estimates for the training corpus, test were conducted to ensure that the training corpus was indeed representative of the HC Corpus. Coverage analysis showed that over 95% of the tokens in the HC Corpus could be obtained with 7 percent of the HC Corpus. Chi-squared comparisons of the lexical feature distributions showed that the training corpus included the range and distribution of lexical features HC Corpus.

The next section summarizes the exploratory data analysis of the training set.

2.5 Exploratory Data Analysis

The qualitative and quantitative exploratory data analysis (EDA) summarized in this section was focused on identifying features in the training set that could be exploited in the language model. Starting with the descriptive statistics, the report examines word nGram frequencies, lexical feature distributions, and POS nGram occurrences.

2.5.1 Descriptive Statistics

The following table outlines the descriptive statistics for the training corpus.

Table 14: Training Set Descriptive Statistics

Genre Size (MB) Sentences Tokens Words Types Mean Sentence Length Mean Word Length
Blogs 14 85,221 1,358,510 1,358,510 53,554 16 4
News 6 29,988 520,308 520,308 33,223 17 5
Twitter 6 67,473 532,910 532,910 30,715 8 4
Corpus 25 182,682 2,411,728 2,411,728 74,680 13 4

Reducing the size of the corpus by 97% while retaining the lexical diversity and feature distribution of the HC Corpus, was the key finding of this analysis.

2.5.2 Lexical Diversity Analysis

Lexical diversity, or richness, is a measure of the different words that are used in a text. Various measures of lexical richness: Yule’s K (Yule 1944), the Zipf slope (George Kingsley Zipf 1935), the type-token ratio, Herdan’s C (Herdan 1960), Guiraud’s R (Guiraud 1954), the mean log frequency (Carroll 1967), the Mean Segmental Type-Token Ratio (Johnson 1944) and the vocabulary growth rate (the ratio of hapax legomena to tokens sampled) (Baayen and R. H. Baayen 2008) were calculated and plotted cumulatively over samples from each register. Since the lexical diversity measures above are sensitive to sample size, the lexical diversity measures were computed on single samples of 520,300 tokens taken from each register.

Figure 11: Training Corpus Lexical Diversity Analysis

The curvilinear distribution of word types (upper left), shows that vocabulary growth is relatively stable within text segments, but the occurrences of new types decrease throughout the course of the text. The vocabulary growth rate is the probability of encountering an unseen type, after having read \(N\) tokens, and is estimated by the ratio of the number of hapax legomena (types of frequency 1), to the number of tokens, \(N\). The type token ratio decreased rapidly as the number of tokens read approached approximately 100,000 tokens, then decreased at a slower rate as the number of tokens read increased. The mean log frequency is the average frequency of all words that appear up to \(N\) tokens read. Herdan’s C is the ratio of the log of the vocabulary size \(V\) and tokens \(N\) (Herdan 1960). Guiraud’s R is the ratio of the vocabulary \(V\) and the square root of the number of tokens \(N\). Yule’s K, often used for author attribution of literary texts, is a measure of lexical repetition that is independent of text size. Finally, Zipf’s Slope (Baayen 2013) is Zipf’s rank-frequency curve in the double-logarithmic space.

Table 15: Lexical Diversity Measures at \(N\) = 520,300 tokens

Category TypeTokenRatio Lognormal Herdan Guiraud Yule Zipf
Blogs 0.06 0.91 0.59 46.32 76.88 -1.04
News 0.06 1.02 0.54 46.78 75.57 -0.96
Twitter 0.06 0.87 0.58 42.60 55.40 -1.06

Table 15 summarizes the relative lexical diversity measures for 520,300 tokens read from each register.

Consistent with the priori supposition, the news register appeared to have the greatest degree of lexical diversity across all measures; whereas, the twitter register had the least. They key finding is that the lexical diversity-based register sample size estimates were consistent with the register lexical diversity measures in this analysis. The greater the lexical diversity of a register, the greater the sampling proportion required to representative.

2.5.3 Lexical Density Analysis

Lexical density, the ratio of content words (nouns, adjectives, verbs, adverbs) to all words in a text, is an estimated measure of content within a text. While N-Gram models work well with function words, they tend to underperform in content word prediction. Incorporating semantic information contained in content words can improve overall prediction accuracy. Lexical density measures can indicate the degree to which semantic analysis can improve overall prediction accuracy.

Table 16: Lexical Density Analysis of Training Corpus

Register Tokens Content Words Density
Blogs 1,409,168 543,937 38.6
News 537,537 237,139 44.1
Twitter 560,945 229,021 40.8
Corpus 2,507,650 1,010,097 40.3

Overall density for the corpus was 40%, with the news register leading the group at 44%. The blogs register, somewhat surprisingly, had the lowest percentage of content words. The key finding, in any case, was the seemingly low level of lexical density in the corpus. The potential impact of semantic analysis on prediction accuracy remained an open question at this stage of the analysis.

2.5.4 Lexical Feature Analysis

The following plot depicts the distribution of select lexical features commonly used in variation studies. Figure 12: Lexical Feature Distributions (per 2,000 words)

As found in the master corpus, singular nouns were most common, followed by personal pronouns, plural nouns, adjectives and adverbs. WH-clauses were rare and highly varied throughout the corpus. The key finding was that the range and distribution of lexical features across registers, was a positive indication for an integrated word and POS-based N-Gram prediction model.

2.5.5 Word N-Gram Frequency Distribution Analysis

Here, the frequency distribution of word N-Grams was evaluated vis-a-vis a zipfian distribution. This analysis intimates the validity of zipf-based inferences on downstream language modeling tasks, such as language model pruning. zipf plots reveal the extent to which the frequencies of nGrams modeled a zipfian distribution.

Figure 13: N-Gram Zipf Plots

The key finding, as shown in Figure 13, was that the visual indication of goodness-of-fit revealed systematic deviations from a zipfian distribution. Any downstream word frequency based decisions would require a LNRE model such as the Zipf-Mandelbrot (Mandelbrot 1961) model used during the corpus design phase.

2.5.6 Word N-Gram Coverage Analysis

This section examines the numbers of total n-grams covered as new n-grams are introduced in order of descending frequency. Again, this analysis would inform any language model pruning required downstream.

Figure 14: Word N-Gram Coverage

Each plot is annotated with the number of N-Grams required to achieve 50%, 75%, and 95% coverage of all N-Grams in the Training Corpus. Despite the deviations from the zipf distribution, the unigram coverage data behaved in a rather “zipfian” manner. On the other hand, the coverage curves for the bigrams, trigrams, and quadgrams didn’t reflect a “zipfian” behavior in the least. The trigram and quadgram coverage plots are nearly linear. The key finding of this analysis was that any pruning from the higher-order n-grams would have a significant and detrimental affect on coverage, and prediction accuracy.

2.5.7 Word N-Gram Feature Analysis

The training corpus comprised some 76,172 unigrams, 783,974 bigrams, 1,517,752 trigrams, and 1,718,465 quadgrams. The following illuminates the top 50 features by frequency for each N-Gram order.

Figure 15: 50 Most frequent of \(N\) = 76,172 unigrams

Figure 16: 50 Most frequent of \(N\) = 783,974 bigrams

Figure 17: 50 Most frequent of \(N\) = 1,517,752 trigrams

Figure 18: 50 Most frequent of \(N\) = 1,718,465 quadgrams

This analysis reveals the degree to which lexical density increases with each higher-order ngram. The most frequent unigrams were comprised almost exclusively of function words. The proportion of content words in the top n-gram features increases with each higher-order. The key finding of this analysis was that content increases with context.

2.5.8 Syntactic (POS) N-Gram Analysis

The training corpus comprised some 40 POS unigrams, 1,110 POS bigrams, and 14,870 POS trigrams. The following reveals the top features by frequency for each Ngram.

Figure 19: 40 Most frequent of \(N\) = 40 unigrams

Figure 20: 50 Most frequent of \(N\) = 1,110 bigrams

Figure 21: 50 Most frequent of \(N\) = 14,870 trigrams

This analysis confirms some aspects of the lexical feature analysis, but goes further. The key finding of this analysis was that the bigram and trigram co occurrence information intimates the value of an integrated word and POS-based N-Gram model.

2.5.9 Exploratory Data Analysis (EDA) Summary

Eight analyses were undertaken to reveal linguistic characteristics of the corpus that could be exploited during the language modeling phase. The following summarizes the key findings from each analysis.

  1. Descriptive Statistics - outlined the basics in terms of numbers of sentences, words and word types.
  2. Lexical Diversity Analysis - confirmed that the greater the lexical diversity of a register, the greater the sampling proportion required to be representative.
  3. Lexical Density Analysis - the seemingly low lexical density estimates called into question the value of an integrated semantic prediction model.
  4. Lexical Feature Analysis - the range and distribution of lexical features across registers, was a positive indication for an integrated word and POS-based N-Gram prediction model.
  5. Word N-Gram Frequency Distribution Analysis - the distribution of n-grams did not model a “zipfian” distribution.
  6. Word N-Gram Coverage Analysis - any pruning from the higher-order n-grams would have a significant and detrimental affect on coverage, and prediction accuracy.
  7. Word N-Gram Feature Analysis - the increased lexical density of higher-order n-grams showed that content increases with context
  8. Syntactic (POS) N-Gram Analysis - the bigram and trigram co occurrence information intimates the value of an integrated word and POS-based N-Gram model.

3 Next Steps

Having concluded the analysis phase, the modeling phase will include corpus transformation, predictive modeling, tuning and model integration. Specifically, the following tasks will be undertaken:

  1. Preprocess the training, validation, and test corpora for out-of-vocabulary (OOV) words. All hapax legomena will be replaced with the ‘UNK’, the unknown pseudoword.
  2. Process the corpora to include sentence boundary annotations.
  3. Complete the Modified Kesner-Ney Probabilistic 4-Gram Model
  4. Complete and integrate the Syntactic Language Model
  5. Complete and integrate the Semantic Language Model
  6. Tune model interpolation factors
  7. Conduct model evaluation experiments

Lastly, the development phase will include the design, implementation, and roll-out of the prediction system and the production of the project final slide deck.

4 References

Baayen, R H. 2013. “Package ’languageR’ : Data sets and functions for Analyzing Linguistic Data: A practical introduction to statistics”.”

Baayen, R H, and R. H. Baayen. 2008. Analyzing Linguistic Data: A Practical Introduction to Statistics using R. Vol. 2. 3. Cambridge University Press. doi:10.1558/sols.v2i3.471.

Benoit, Kenneth, Paul Nulty, Kohei Watanabe, Benjamin Lauderdale, Adam Obeng, Pablo Barberá, and Will Lowe. 2016. “quanteda: Quantitative Analysis of Textual Data.” https://github.com/kbenoit/quanteda.

Biber, Douglas. 1993. “Representativeness in Corpus Design.pdf.”

Carroll, J.B. 1967. “On sampling from a lognormal model of frequency distribution.” In Computational Analysis of Present-Day American English, edited by H Kucera and W.N. Francis, 406–24. Providence: Brown University Press.

Christensen, Hans. 2016. “HC Corpora.” http://www.corpora.heliohost.org/.

Cool-Smileys. 2010. “List of Text Emoticons: The Ultimate Resource.” http://cool-smileys.com/text-emoticons.

Dowle, Matt. n.d. “Package ’data.table’.” https://cran.r-project.org/web/packages/data.table/data.table.pdf.

Duggan, Maeve. 2015. “Mobile Messaging and Social Media 2015 | Pew Research Center.” http://www.pewinternet.org/2015/08/19/mobile-messaging-and-social-media-2015/.

Evert, Stefan. 2015. “Package ‘ZipfR’.” R Package, 94.

Evert, Stefan, and Marco Baroni. 2005. “Testing the extrapolation quality of word frequency models.” Proceedings of Corpus Linguistics 2005, 1747–1939. http://www.stefan-evert.de/PUB/EvertBaroni2005.pdf.

Google. 2012. “Offensive Words from Google’s ’What Do You Love’ Project.” https://gist.github.com/jamiew/1112488.

Guiraud, Pierre. 1954. Les caractères statistiques du vocabulaire.

Herdan, Gustav. 1960. Type-token Mathematics. Mouton.

Johnson, W. 1944. “Studies in Language Behavior: I. A program of research” 56 (2): 1–15.

Kennon, Joshua. 2015. “Blog Demographics 2015 Edition: If Life Were a Game, You All Would Be Champions.” http://www.joshuakennon.com/blog-demographics-2015-edition-if-life-were-a-game-you-would-be-champions/.

Khmaladze, E. V. 1988. “The statistical analysis of a large number of rare events.” Department of Mathematical Statistics, no. R 8804 (January). CWI: 1–21. http://oai.cwi.nl/oai/asset/5988/5988A.pdf.

Kottmann, Jörn, Grant Ingersoll, Isabel Drost, James Kosin, Jason Baldridge, Thomas Morton, William Silva, et al. 2016. “openNLP: Apache OpenNLP Tools Interface.” https://opennlp.apache.org/.

Mandelbrot, B B. 1961. “On the theory of word frequencies and on related Markovian models of discourse.” In Structures of Language and Its Mathematical Aspects, edited by R Jacobsen. New York: American Mathematical Society.

Moreno-Sanchez, Isabel, Francesc Font-Clos, and Alvaro Corral. 2016. “Large-scale analysis of Zipf’s law in English texts.” PLoS ONE 11 (1). doi:10.1371/journal.pone.0147073.

Rstudio Team. 2016. “RStudio – Open source and enterprise-ready professional software for R.” https://www.rstudio.com/.

The R Foundation. 2015. “What is R?” doi:10.1016/B978-0-12-417113-8.09994-X.

Wikipedia. 2016. “Wikipedia:List of English contractions.” http://www.oxforddictionaries.com/us/definition/english/amn't.

Wiktionary. 2016. “Appendix:English internet slang - Wiktionary.” https://en.wiktionary.org/wiki/Appendix:English{\_}internet{\_}slang.

Yule, George Udny. 1944. The Statistical Study of Literary Vocabulary. Cambridge University Press.

Zipf, George K. 1936. The Psychobiology of Language: An Introduction to Dynamic Philology. London: Routledge.

Zipf, George Kingsley. 1935. The Psycho-Biology Of Language: AN INTRODUCTION TO DYNAMIC PHILOLOGY. Vol. ix. Houghton Mifflin Company. doi:10.1097/00005053-193701000-00041.